An article that explore the creation of a age-sex pyramid representing Singapore population data collected on the month of June from year 2000 to 2020 programatically using R
This case study aims to portray the age-sex structural trend that shifts from 2000 to 2020 for the population of Singapore. Different from Take Home Exercise 1, we will learn how to convert the population pyramid static chart created using ggplot2 into an interactive one; using interactive charting packages for R.
More details pertaining to the data and graphs will be explained in later sections.
The aim of this case study explores the creation of a population pyramid representing Singapore population data collected on the month of June from year 2000 to 2020, using R.
The Singapore Residents by Planning Area/Subzone,AgeGroup.Sex and Type of Dwelling Singstat Dataset representing the number of people staying in different regions of Singapore by age cohort, sex and dwelling details. This time, we will be making use of two data sets from the same source that would need to be combined later in order to represent the complete population data set for our case study.
These are namely:
1.Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010
Notice that these data sets are in separate data frames and there is a need to do some data wrangling to prepare the data for visualization. However, before dwelling into that, we will explore some ideas and inspiration from in-class exercise and previous Take Home Exercise 1 to understand how to create the charts better but first let’s understand the task.
Build an interactive age-sex pyramid representing the demographic structure of Singapore by age cohort and gender that show the trends of the age-sex structure of Singapore population from 2000 to 2020 at planning area level
The Population pyramid is used to present the distribution of different age cohort representing a particular population. It is well known for its pyramid looking shape. Typically it shows a continuous stacked horizontal histogram bar. The population size is on the x-axis while the age-cohort would be presented on the y-axis. The height of each bar typically represents either an absolute frequency or a percentage of the number of people in each age cohort.
Through the population pyramid, we can understand age-sex structure of the Singapore population and identify the population pyramid trend which can unveal things about fertility and motility and whether it is a shrinking population.
Instead of presenting graphs in a fixed frame, we can create interactive charts.
Interactive charts have become a popular way to allow users to explore the visualize and interact with chart elements to get a better sense/context of the graphical message. The good thing about using dash boarding platforms like tableau, is its ability to provide customization interactive capabilities with a few clicks away. This often contain animations and interactive element customization to provide a unique chart exploration experience.
Notice that in R, this capability is not automatically applied to visuals created using the ggplot2 library. Instead, we can utilize interactive plotting elements from other libraries to create these interactive elements.
We need to understand how interactive charts would work in R. There are two known approaches to create interactive charts:
ggplot geometric objects and utilize interactive packages to add on the capabilities interactivityplotlyBefore we begin to select the method, we will sketch out the graph that we are going to make.
These sketches were inspired by interactive capabilities learnt from in class exercise 3 on tableau.
A population pyramid shows the distribution of a population by age group and sex. The term pyramid is used to depict a growing population, which most countries hope to achieve. In the context of Singapore, this population pyramid will depict Singapore’s population distribution and the colors, blue and pink will be used the males and females respectively. Since in this task, we are to provide the pyramid for all planning areas, we will create the drop down menu for users to select the planning area they want to view and a play button to allow users to view the population trend over the years from 2000 to 2020.
Notice from the sketch that we are going to build on to a proposed interactive chart shown in Take Home Exercise 1. The above sketch have improved on the following :
Notice that we have included a tooltip focused on the population value rather than including extra information.
Notice that we have included a dropdown menu that provides the user with a way to filter for planning areas in R (inspired by Tableau filters)
Notice that a timeline player is included to add extra intractivity for users to be able to view the population structure over time from 2000 to 2020
Overall, the focus here is to understand the use of customisation propertises in which we will be building our graph differently from take of Take Home Exercise 1.
This section provides a summary of the packages required for this exercise.
tidyverse : A collection of core pakages designed for data science in R
plotly : Package used to creating interactive web-based graphs via the open source JavaScript graphing library plotly. js
ggiraph : Package used to create dynamic ggplot graphs. This allows you to add tooltips, hover effects and JavaScript actions to the graphics.
patchwork : Package used to combine separate ggplots into the same graphic
gganimate : Package used to include the description of animations
gifski : Package used to convert images to gifs
knitr : Package used for dynamic report generation
kableExtra : Package used for table generation and simple table output designs
The following code chunk will check if the required libraries are installed first before loading them into the R environment.
packages = c('ggiraph', 'plotly',
'DT', 'patchwork',
'gganimate', 'tidyverse',
'readxl', 'gifski', 'kableExtra', 'knitr','crosstalk')
for (p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The following are the data challenges faced:
Notice that the data representing the population trend from 2000 to 2020 are in separate files. As such, there will be a need to combine these data. We also need to verify that the union of the files if successful
We need to check for abnormal data fields and missing values and exclude them from the data set. We also need to remember to factorize the age group column as we did in Take Home Exercise 1 using factor()
3.Notice that this time, there is a need to be aware of the multiple columns and groups we need to group_by() later on using dyplr package and other data preparation methods to prepare the data
4.Knowing that we have a way to already create a ggplot population pyramid, we need to now think of the geometric customization of the interactive libraries to produce the same graph but with more intuitive feature for interaction and understanding.
The data set is of ‘.csv’ extension which equates to comma separated field format. As such, the read_csv function using the readr library can be used as seen below.
Here are have Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010
| PA | SZ | AG | Sex | TOD | Pop | Time |
|---|---|---|---|---|---|---|
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 1- and 2-Room Flats | 20 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 3-Room Flats | 480 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 4-Room Flats | 220 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 5-Room and Executive Flats | 80 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HUDC Flats (excluding those privatised) | 0 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | Landed Properties | 0 | 2000 |
and here we have Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020
| PA | SZ | AG | Sex | TOD | Pop | Time |
|---|---|---|---|---|---|---|
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 1- and 2-Room Flats | 0 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 3-Room Flats | 10 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 4-Room Flats | 30 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 5-Room and Executive Flats | 50 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HUDC Flats (excluding those privatised) | 0 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | Landed Properties | 0 | 2011 |
Next, we will need to check for missing values and remove them if any. The code below checks for missing values. Notice that both data set do not have missing values.
# check for missing values, rows with more 70% missing
print(pop_2010[rowSums(is.na(pop_2010)) >= 0.7,])
# A tibble: 0 × 7
# … with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>,
# TOD <chr>, Pop <dbl>, Time <dbl>
# A tibble: 0 × 7
# … with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>,
# TOD <chr>, Pop <dbl>, Time <dbl>
Now we will need to check the columns to ensure that both tables contain the same number of columns before we union them together. The code chunk below shows the name of the columns of the two data frames. It seems that they contain the same number of columns and so we will combine them.
[1] "PA" "SZ" "AG" "Sex" "TOD" "Pop" "Time"
[1] "PA" "SZ" "AG" "Sex" "TOD" "Pop" "Time"
# all same + the data type is same
Combining the columns using the union() operation by dplyr will allow us to merge the datasets one under another, merge to the bottom. Following the printed tables, on the top, we have the table where it shows to be from 2000 and the bottom the table where it shows to be 2020.
# union these data set
combined_pop <- union(pop_2010, pop_2020)
| PA | SZ | AG | Sex | TOD | Pop | Time |
|---|---|---|---|---|---|---|
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 1- and 2-Room Flats | 20 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 3-Room Flats | 480 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 4-Room Flats | 220 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HDB 5-Room and Executive Flats | 80 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | HUDC Flats (excluding those privatised) | 0 | 2000 |
| Ang Mo Kio | Cheng San | 0_to_4 | Males | Landed Properties | 0 | 2000 |
| PA | SZ | AG | Sex | TOD | Pop | Time |
|---|---|---|---|---|---|---|
| Yishun | Yishun West | 90_and_over | Females | HDB 4-Room Flats | 60 | 2020 |
| Yishun | Yishun West | 90_and_over | Females | HDB 5-Room and Executive Flats | 20 | 2020 |
| Yishun | Yishun West | 90_and_over | Females | HUDC Flats (excluding those privatised) | 0 | 2020 |
| Yishun | Yishun West | 90_and_over | Females | Landed Properties | 0 | 2020 |
| Yishun | Yishun West | 90_and_over | Females | Condominiums and Other Apartments | 10 | 2020 |
| Yishun | Yishun West | 90_and_over | Females | Others | 30 | 2020 |
To ensure that data have successfully merge, lets ensure that the rows combined are the same.
[1] 2025248
[1] 2025248
# Turned out same
After successfully merging the data sets, lets check if the values of the categorical column and ensure that they are consistent to ensure data quality. The code chunk below would do the former.
# checking combined data for missing values
print(unique(combined_pop$PA)) # Not stated is found, exclude using dyplr mutate
[1] "Ang Mo Kio" "Bedok"
[3] "Bishan" "Boon Lay/Pioneer"
[5] "Bukit Batok" "Bukit Merah"
[7] "Bukit Panjang" "Bukit Timah"
[9] "Central Water Catchment" "Changi"
[11] "Changi Bay" "Choa Chu Kang"
[13] "Clementi" "Downtown Core"
[15] "Geylang" "Hougang"
[17] "Jurong East" "Jurong West"
[19] "Kallang" "Lim Chu Kang"
[21] "Mandai" "Marina East"
[23] "Marina South" "Marine Parade"
[25] "Museum" "Newton"
[27] "North-Eastern Islands" "Novena"
[29] "Orchard" "Outram"
[31] "Pasir Ris" "Paya Lebar"
[33] "Punggol" "Queenstown"
[35] "River Valley" "Rochor"
[37] "Seletar" "Sembawang"
[39] "Sengkang" "Serangoon"
[41] "Simpang" "Singapore River"
[43] "Southern Islands" "Straits View"
[45] "Sungei Kadut" "Tampines"
[47] "Tanglin" "Tengah"
[49] "Toa Payoh" "Tuas"
[51] "Western Islands" "Western Water Catchment"
[53] "Woodlands" "Yishun"
[55] "Not Stated" "Boon Lay"
[57] "Pioneer"
[1] "0_to_4" "5_to_9" "10_to_14" "15_to_19"
[5] "20_to_24" "25_to_29" "30_to_34" "35_to_39"
[9] "40_to_44" "45_to_49" "50_to_54" "55_to_59"
[13] "60_to_64" "65_to_69" "70_to_74" "75_to_79"
[17] "80_to_84" "85_to_89" "90_and_over"
[1] "Males" "Females"
From the above, it seems like *Not Stated* value of the PA (Planning Area) Column show to be rows that can affect the data quality of the visualization and analysis. As such, during the data wrangling stage, we will exclude rows with this value
Also notice that there are many planning areas. We will go ahead and perform another data quality check to ensure that all areas have population values.
select() all planning locationsgroup_by() planning locations and summarised() sum their total population valuesfilter() for all sums that are 0select() only the planning area and finally ungroup()Museum which will be excludedLocation_pop <- combined_pop %>%
select(PA, Pop) %>%
group_by(PA) %>%
summarise(total=sum(Pop)) %>%
filter(total==0) %>%
select(PA) %>%
ungroup()
kable(Location_pop)
| PA |
|---|
| Boon Lay |
| Boon Lay/Pioneer |
| Central Water Catchment |
| Changi Bay |
| Marina East |
| Marina South |
| Paya Lebar |
| Pioneer |
| Simpang |
| Straits View |
| Tengah |
| Tuas |
| Western Islands |
no_pop_locations <- as.vector(Location_pop$PA)
no_pop_locations <- c(no_pop_locations, c("Museum"))
print(no_pop_locations)
[1] "Boon Lay" "Boon Lay/Pioneer"
[3] "Central Water Catchment" "Changi Bay"
[5] "Marina East" "Marina South"
[7] "Paya Lebar" "Pioneer"
[9] "Simpang" "Straits View"
[11] "Tengah" "Tuas"
[13] "Western Islands" "Museum"
These locations will be filtered out in data wrangling. Removing these locations will also help with the interactie chart render time.
Upon successfully checking the quality and preparing the raw data to be wrangled, we will proceed with the data wrangling process. The following code chunk will do the following :
select() : Planning Area, Time, Age group, gender and population columnsgroup_by() : Planning Area, Time, Age group and gender to sum up the population for the respective categoriessummarise() : Summaries these respective groups with the total populationarrange() : Sort the output data frame according to Planning Area, Time, Age group and then genderfilter() : Filter out the *Not Stated* value of the PA (Planning Area) Column as mentioned in the previous section. *%in%* is the same as *in* operator used to check if a column contains a value from a list/vector of string values. ! operator in this case would do the oppositeungroup() : It is good practice to un-grouping your data after every group_by() function# GET DATA
combined_pop_grouped <- combined_pop %>%
select(PA,Time,AG, Sex, Pop) %>%
group_by(PA,Time,AG,Sex) %>%
summarise(Total = sum(Pop)) %>%
arrange(PA,Time,Sex,AG) %>%
filter(PA != 'Not Stated') %>%
filter(!PA %in% no_pop_locations) %>%
ungroup()
# renaming table columns
names(combined_pop_grouped) <- c("Planning_Area","Year","Age_Group","Gender","Population")
Quick check
[1] "Ang Mo Kio" "Bedok"
[3] "Bishan" "Bukit Batok"
[5] "Bukit Merah" "Bukit Panjang"
[7] "Bukit Timah" "Changi"
[9] "Choa Chu Kang" "Clementi"
[11] "Downtown Core" "Geylang"
[13] "Hougang" "Jurong East"
[15] "Jurong West" "Kallang"
[17] "Lim Chu Kang" "Mandai"
[19] "Marine Parade" "Newton"
[21] "North-Eastern Islands" "Novena"
[23] "Orchard" "Outram"
[25] "Pasir Ris" "Punggol"
[27] "Queenstown" "River Valley"
[29] "Rochor" "Seletar"
[31] "Sembawang" "Sengkang"
[33] "Serangoon" "Singapore River"
[35] "Southern Islands" "Sungei Kadut"
[37] "Tampines" "Tanglin"
[39] "Toa Payoh" "Western Water Catchment"
[41] "Woodlands" "Yishun"
Next, the total Population size of males for each group will be multiplied by a factor of -1 to vertically flip the values from the positive x-axis to the negative x-axis.
# All males are negative so they go to the left
combined_pop_grouped$Total_Population <- combined_pop_grouped$Population
combined_pop_grouped$Total_Population <- ifelse(combined_pop_grouped$Gender == "Males"
, -1*combined_pop_grouped$Total_Population
, combined_pop_grouped$Total_Population)
| Planning_Area | Year | Age_Group | Gender | Population | Total_Population |
|---|---|---|---|---|---|
| Yishun | 2020 | 65_to_69 | Males | 6080 | -6080 |
| Yishun | 2020 | 70_to_74 | Males | 3840 | -3840 |
| Yishun | 2020 | 75_to_79 | Males | 1790 | -1790 |
| Yishun | 2020 | 80_to_84 | Males | 1160 | -1160 |
| Yishun | 2020 | 85_to_89 | Males | 510 | -510 |
| Yishun | 2020 | 90_and_over | Males | 180 | -180 |
| Planning_Area | Year | Age_Group | Gender | Population | Total_Population |
|---|---|---|---|---|---|
| Ang Mo Kio | 2000 | 0_to_4 | Females | 4460 | 4460 |
| Ang Mo Kio | 2000 | 10_to_14 | Females | 5520 | 5520 |
| Ang Mo Kio | 2000 | 15_to_19 | Females | 5930 | 5930 |
| Ang Mo Kio | 2000 | 20_to_24 | Females | 7160 | 7160 |
| Ang Mo Kio | 2000 | 25_to_29 | Females | 7750 | 7750 |
| Ang Mo Kio | 2000 | 30_to_34 | Females | 6810 | 6810 |
On the bottom, notice female values for total_population is positive while male is negative (top table) this will be used for the barplot element.
Interactive Charts also often make use of tool tips. As such, we will set up the text that we want to display to users. One thing we need to clean is the *Age Group* column. Notice that there are _ (underscores) that act as spaces between the characters.
kable(combined_pop_grouped$Age_Group[1:5])
| x |
|---|
| 0_to_4 |
| 10_to_14 |
| 15_to_19 |
| 20_to_24 |
| 25_to_29 |
To resolve this, the following code chunk will show how we utilize the sub() function to substitute the _ and create a tool tip column that shows the Gender,Age Group and Population. It will utilize the \n special character to signal for a new line between these fields in the tool tip. Once we have replace the characters, we will factorize the Age Group column; respective age bins (categories) in an ordered manner, with those of ages 0 being the smallest and over 90 being the largest value in the scale.
# Substituting '_'
combined_pop_grouped$Age_Group<-sub('_to_', ' to ', combined_pop_grouped$Age_Group)
combined_pop_grouped$Age_Group<-sub('_and_', ' and ', combined_pop_grouped$Age_Group)
# Creation of the tooltip
combined_pop_grouped$tooltips <- c(paste0("Gender = ", combined_pop_grouped$Gender
, "\n Age Group = ", combined_pop_grouped$Age_Group
, "\n Population = ", combined_pop_grouped$Population ))
# Similarly, like take home excessive one, we will factorize the age group
combined_pop_grouped$Age_Group <- factor(combined_pop_grouped$Age_Group, ordered=TRUE ,levels=c("0 to 4","5 to 9","10 to 14","15 to 19","20 to 24","25 to 29","30 to 34","35 to 39","40 to 44","45 to 49","50 to 54","55 to 59","60 to 64","65 to 69","70 to 74","75 to 79","80 to 84","85 to 89","90 and over"))
While we have removed a number of places with 0 population, we also realized that there are too many planning areas. To provide a higher level visualization to understand population structures over time, we will go ahead and aggregate the data to provide for higher level view of the data set.
Region column to group the different planning areas by their region. We will make use of dyplr case when function through the with function to create the conditions to group the planning areas into regions. The regions of Singapore and hand recorded and found on the Singapore Wiki of Regioncombined_pop_grouped$Region <- NA
combined_pop_grouped$Region <- with(combined_pop_grouped, dplyr::case_when(
Planning_Area %in% c('Central Water Catchment','Lim Chu Kang','Mandai'
,'Sembawang','Simpang','Sungei Kadult','Woodlands','Yishun') ~'North',
Planning_Area %in% c('Ang Mo Kio','Hougang','North-Eastern Islands','Punggol','Seletar','Sengkang','Serangoon')~'North East',
Planning_Area %in% c('Bedok','Changi','Changi Bay', 'Paya Lebar','Pasir Ris',
'Tampines') ~ 'East',
Planning_Area %in% c('Boon Lay','Bukit Batok','Bukit Panjang','Choo Chu Kang', 'Clementi','Jurong East','Jurong West','Pioneer','Tengah','Tuas','Western Islands','Western Water Catchment') ~ 'West',
TRUE ~ 'Central'))
Also notice that the Age-Groups can be further divided into respective groups like the ‘young’ , the ‘working age’ and the ‘aged’. We have introduced young and old in the first take home exercise but for now we will go ahead and segment these age groups into the follow categories based on this paper which is the age classification by Sing stat.
case when function and then use the %in% operator to group the values specific by the respective vectors and call this new column Age_Group_catyoung <- c("0 to 4","5 to 9","10 to 14","15 to 19","20 to 24")
working_age <- c("25 to 29","30 to 34","35 to 39","40 to 44","45 to 49","50 to 54","55 to 59","60 to 64")
aged <- c("65 to 69","70 to 74","75 to 79","80 to 84","85 to 89", "90 and over")
combined_pop_grouped <- combined_pop_grouped %>%
mutate(Age_Group_cat = case_when(
Age_Group %in% young ~ "young",
Age_Group %in% working_age ~ "working_age",
Age_Group %in% aged ~ "aged",
))
head(combined_pop_grouped)
# A tibble: 6 × 9
Planning_Area Year Age_Group Gender Population Total_Population
<chr> <dbl> <ord> <chr> <dbl> <dbl>
1 Ang Mo Kio 2000 0 to 4 Females 4460 4460
2 Ang Mo Kio 2000 10 to 14 Females 5520 5520
3 Ang Mo Kio 2000 15 to 19 Females 5930 5930
4 Ang Mo Kio 2000 20 to 24 Females 7160 7160
5 Ang Mo Kio 2000 25 to 29 Females 7750 7750
6 Ang Mo Kio 2000 30 to 34 Females 6810 6810
# … with 3 more variables: tooltips <chr>, Region <chr>,
# Age_Group_cat <chr>
We will go head and create a data frame where we can view the number of aged and young population. To do so, we need the spread column. Spread() function takes in a key and a value column. First it will spread the keys into respective columns and fill it with the value. This allows us to transform categorical columns into respective columns with values for each unique row. Doing so will allow us to obtain a planning area by year and region with the number of population by the three different age group categories. After which will can then :
group_by() year region and planning yearsummarise() and obtain the sum of population for each age group cat for each of the respective groupsmutate() obtain and overall population column from the sum of the tree. Notice that we did not use total_population because that was not casted into the spread()functioncombined_pop_grouped_inter <- combined_pop_grouped %>%
mutate(i = row_number()) %>%
spread(Age_Group_cat, Population) %>%
select(-i)
# Replace NA with 0 to help with summation in the future
combined_pop_grouped_inter[is.na(combined_pop_grouped_inter)] <- 0
combined_pop_grouped_inter_agg <-combined_pop_grouped_inter %>%
group_by(Year,Region,Planning_Area) %>%
summarise(Aged=sum(aged),Working_Age=sum(working_age),Young=sum(young)) %>%
mutate(Population= Aged + Working_Age + Young)
head(combined_pop_grouped_inter_agg)
# A tibble: 6 × 7
# Groups: Year, Region [1]
Year Region Planning_Area Aged Working_Age Young Population
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2000 Central Bishan 6170 52710 31400 90280
2 2000 Central Bukit Merah 18720 90020 40130 148870
3 2000 Central Bukit Timah 5250 37670 21670 64590
4 2000 Central Choa Chu Kang 5470 77240 54320 137030
5 2000 Central Downtown Core 660 2680 1010 4350
6 2000 Central Geylang 12440 69180 37520 119140
As seen in the Tableau example, obtaining this data set will certainly allow us to obtain a sum young/aged animated chart but it would not show the values do not reflect the ratio it has against the working age group. As such, at times, visualization would require us to engineering certain values to better represent the visuals and help with drilling down our analysis.
We will go ahead and create two ratios that is often used by the (Eurostat)[https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Young-age_dependency_ratio] community. These are the Young_Dependency_Ration and the Aged_Dependency_Ratio; which is the ratio of the young/aged group against the working age. This will help us better understand the pressure on the working population. Typically, those not in the labor force against those typically in the labor force.
In the below code chunk:
We will use select() for the respective columns and obtain the sum of the Aged, Working_Age and Young category
group_by() the year and region will obtain the sum of the age categories by year and region
3, Mutate and obtain the dependencies ratio and round it off to 1 decimal place. Followed by an ungroup() function
combined_pop_grouped_inter_agg <- combined_pop_grouped_inter_agg %>%
select(c("Year","Region","Aged","Working_Age","Young","Population")) %>%
group_by(Year, Region) %>%
summarise(Aged=sum(Aged),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
mutate(Aged_Dependency_Ratio= Aged/Working_Age) %>%
mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
mutate(Aged_Dependency_Ratio = round(Aged_Dependency_Ratio*100, digits=1),
Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=1)) %>%
ungroup()
head(combined_pop_grouped_inter_agg)
# A tibble: 6 × 8
Year Region Aged Working_Age Young Population Aged_Dependency_…
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 Central 105240 612240 318730 1036210 17.2
2 2000 East 40290 371760 244500 656550 10.8
3 2000 North 17910 228050 153170 399130 7.9
4 2000 North … 39130 331880 199290 570300 11.8
5 2000 West 31890 352240 224910 609040 9.1
6 2001 Central 108050 609890 313010 1030950 17.7
# … with 1 more variable: Young_Dependency_Ratio <dbl>
To build the Population Pyramid chart there are a few components and geometric objects we need to be familiar with. The core components come from the plotly and ggplot library. It’s uses in this segment is explained as follows:
plot_ly(): The plot_ly() function provides a ‘direct’ interface to plotly.js with some additional abstractions to help reduce typing
Traces(): Is like layering of the geometric objects. New traces can be added to a plot_ly figure using the add_trace() method.
layout(): Layout is used for styling purposes like that of titles, x and y axis variables.
frame: The frame parameter segments the data by time frame and then creates individual graphs which will be then combined to form the overall animated gif of the motion time frame. In the case of using plotly, it will instead make use of the js property to create this motion time framed graph.
hoverinfo(): Is the tooltip function to create the hover effect of the graph
[crosstalk::bscols]https://search.r-project.org/CRAN/refmans/crosstalk/html/bscols.html): This helper function makes it easy to put HTML elements side by side. It can be called directly from the console but is especially designed to work in an R Markdown document.
crosstalk::SharedData: SharedData is to be passed to Crosstalk-compatible widgets in place of a data frame.
First, we will plot chart to show the trend of the young-age dependency ratio from 2000 to 2020. This will allow us to identify the population trends by regions of Singapore and pick out anomalies. We have first selected our x and y values and then color the fill by region and create the hoverinfo (tooltip information) as the region. We will then use the population as the size to show the different population size in Singapore. We will proceed to frame() the charts into a series of charts created by the year in which plotly() will play it as a timeline. Lastly, we will set up the layout to set the titles and text formats.
Notice how html elements were used to style the chart title and subtitle.
fig <- combined_pop_grouped_inter_agg %>%
plot_ly(
x = ~Aged_Dependency_Ratio,
y = ~Young_Dependency_Ratio,
size = ~Population,
color = ~Region,
frame = ~Year,
text = ~Region,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
)%>% layout(title="East has the fastest growing aged-dependency ratio trend
<sup>Chart Represents the young-aged dependency ratio for the different regions of Singapore</sup>", xaxis = list(title = list(text ='Age Dependency Ratio')), yaxis = list(title = list(text ='Young Dependency Ratio')))
fig
The above chart shows that the central region have the most number of people staying (most of the percentage of the population reside). However, despite being a larger hub, it still has a growing age dependency ratio. ‘Younger’ regions, like the north east and east particularly have a noticeable fast growth rate of the age-dependency score. It seems like between the year 2014 onwards, the East Region age-dependency score out grown the west, north and north east regions; who were increasing also. This speed is exceptionally different from the other and require more investigation.
To investigate further, lets plot the same chart but with only the east region of Singapore and fill the chart with the planning areas. To do so, we will create a new data frame that calculates the young-age dependency ratio using the same procedure as explained above but this time by regions.
We will then craft the ggplot() object and make use of the facet_wrap() function to create a facet grid plot to show the young-aged dependency ratio of all planning areas by the region. We will also specifc in the aes() a frame=Year; once we output the plot using the ggplotly() function, it will create the interactive chart and read the frame() aesthetics argument as the same frame() argument to create the year timeline.
combined_pop_grouped_inter_agg_PA <- combined_pop_grouped_inter %>%
select(Year, Region ,Planning_Area, aged, young, working_age, Total_Population) %>%
group_by(Planning_Area, Region, Year) %>%
summarise(Aged=sum(aged),Working_Age=sum(working_age),Young=sum(young),Population=sum(Total_Population)) %>%
mutate(Aged_Dependency_Ratio= Aged/Working_Age) %>%
mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
mutate(Aged_Dependency_Ratio = round(Aged_Dependency_Ratio*100, digits=1),
Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=1)) %>%
#filter(Region == "East") %>%
ungroup()
g<-combined_pop_grouped_inter_agg_PA%>%
ggplot(aes(x = Young_Dependency_Ratio, y = Aged_Dependency_Ratio, label=Planning_Area, color = Region, frame=Year))+
geom_point()+
facet_wrap(vars(Region)) +
labs(title = "Planning Areas have an increasing Aged Dependency Ratio Trend", subtitle = "Regions of Singapore Young-Age Dependency Ratio by Planning Area", x="Young Dependency Ratio" , y="Aged Dependency Ratio")
ggplotly(g)
Before interpreting the chart, we can further give the situation context by referring to the mature and non mature estate list to help us interpret the result.
Notice, form the above, that east only has four estates. With all four as a mature estate, this makes sense for the aged dependency to grow faster than other locations where more non-mature estate fall in like the North East; as there is not much newer housing for younger couples (BTOs). For example, North East show interesting patterns like Hougang and Seletar with higher young depdency ratio at first to higher aged dependency score.
To inestigate the East, we will drill down to one location for simplicity, that would be Bedok. Bedok, despite being a mature estate, accelerated in aged dependency score from 2014 onwards. We will investigate further into the age groups living in Bedok and how they the age-sex structure change over time.
To further dive into the problem, we can create an age-sex pyramid / population pyramid of seletar and yet provide interactive charting elements for users to be able to freely explore the data. The solution uses plotly library to build with elements explained earlier. The only difference is generating the update menu that will filter() the data set. It was done using crosstalk package to provide a highlighted data for sharing. The drop down menu will filter for shared data frame that will be passed to the plotly geometric object to plot the population pyramid. Also note, that for the axis range, we will make use of plyr signif function to round of the calculated axis range to the nearest thousand and use an appropriate range value to segregate.
Areas <- unique(combined_pop_grouped$Planning_Area)
get_tick_text <- function(axis_range) {
round_axis_range <- signif(axis_range, digits = 1)
text_val <- seq(round_axis_range, 0, -2000)
t_text <- c(text_val, text_val)
return (t_text)
}
get_tick_val <- function(axis_range) {
round_axis_range <- signif(axis_range, digits = 1)
t_val <- c(seq(round_axis_range, 0, -2000), seq(-round_axis_range, 0, 2000))
return (t_val)
}
mrg <- list(l = 50, r = 50,
b = 50, t = 50,
pad = 20)
generate_graph <- function(axis_range, t_text, t_val) {
fig <- plot_ly(sdf,
x = ~Total_Population,
y = ~Age_Group,
type = 'bar',
hovertext = ~Population,
hoverinfo = 'text',
color = ~Gender,
colors = c("powderblue", "pink")
) %>% layout(
title="Most planning areas in Singapore follow a constrictive trend
<sup>Interactive population pyramid showing age-sex strcutural trend from 2000 to 2020</sup>",
bargap = 0.1,
barmode = 'overlay',
yaxis = list (title = 'Age Group'),
xaxis = list(range=c(-axis_range, axis_range),
tickvals= t_val,
ticktext= t_text,
title='Total Population'),
margin = mrg
)
fig
}
generate_graph_output <- function() {
axis_range <- 14000
t_text <- get_tick_text(axis_range)
t_val <- get_tick_val(axis_range)
generate_graph(axis_range, t_text, t_val)
}
# Wrap data frame in SharedData
sdf <- SharedData$new(combined_pop_grouped)
# Use SharedData like a dataframe with Crosstalk-enabled widgets
bscols(
widths = c(2,NA),
list(
filter_select("Planning_Area", "Area", sdf, ~Planning_Area, multiple = FALSE),
filter_select("Year", "Year", sdf, ~Year, multiple = FALSE)
),
# Create a filter input
#data
generate_graph_output()
)
Selecting Bedok, we can see that from 2014 to 2020, there is a shift in the older age groups upwards and yet the younger age groups seem to drop. This shows that there is a shrinking population trend of Bedok. This is alarming because Bedok while is known to have a larger older generation, it still have many BTO project. It shows that despite these projects, Bedok is still headed towards a shrinking population. In comparison, Hougang, a non-mature estate from 2014 to 2020 have a “slower shrinking population”.
References : Geogrpahical Ditribution of Singapore
We can further appreciate elements of visualization using a geom_point object to outlay the population pyramid as seen below :
bedok_df <- combined_pop_grouped %>%
filter(Planning_Area == "Bedok") %>%
dplyr::select(Year,Age_Group,Gender, Population,Total_Population, tooltips)
ggplot(bedok_df, aes(x = Total_Population, y = Age_Group)) +
geom_point(alpha = 0.7,
show.legend = TRUE,
aes(color = factor(Gender))) +
scale_colour_manual(values = c("#E77878","#4682B4")) +
scale_size(range = c(2, 12)) +
labs(title = 'Population Pyramid trend of Bedok of Year: {frame_time}',
x = 'Total Population',
y = 'Age Group') +
transition_time(as.integer(Year)) +
ease_aes('linear') +
theme(panel.background = element_rect(fill = 'white', colour = 'white'), panel.grid = element_blank()) +
labs(colour = "Gender") +
theme (legend.title = element_text(size=10, face="bold"),legend.key = element_rect(fill = 'white', colour = 'white') )
The above code chunk users these few new elements which was not introduced before. These elements uses the gg-animate package to create the GIF for the motion chart to represent the age-sex structure for each year. This is done using:
transition_time (https://gganimate.com/reference/transition_time.html): This is a variant of transition_states() that is intended for data where the states are representing specific point in timeease_aes (default is linear) : The ease_aes() function controls the easing of aesthetics or variables in gganimateThere are other transition parameters and function to use but it would be necessary to customize too much of the timing and frame structure.
Notice how we used the trend line of the dots to help create a visual effect of the trend changing over the years. Contrast of the points colors is also used against a white background to highlight the moving trend. At the same time keeping the titles and text in consistent colors and spacing to avoid distraction.
Population pyramids are important graphical representation to understand the composition of population members.It is typically visualized by grouping the population members into age cohorts and further diving the data points into their respective gender groups. In other words, the age-sex structure of specific populations. This makes it easy for demographers to compare the difference between male and female populations and the structure of the population at any given moment. Demographers typically use this to study the trend of populations relating to the fertility and mortality.
There are three trends in population pyramids they are typically: - expansive - constrictive - stationary
#
We shall focus our efforts in explaining the trend that is reflected in our plot.
The Singapore population trend across all regions for both gender is depicted to follow a constrictive population pyramid trend with its ‘beehive’ shape as explain in take home exercise 1.
In this exercise, we found that the trends of the different regions of Singapore follow a similar trend. Furthermore, it seems like the aged dependency ratio is getting larger for most part of Singapore. We seen how older/mature estates are becoming more aged-dependent and also how non-mature estates despite having slight increased in young-dependency scores, still have a trend moving with higher aged-dependency scores. This sort of reflect the idea of how Singapore is increasing the working age limit since we are becoming for aged dependent.
Animation Capabilities : Tableau provides a lot of animation capabilities in a click. This includes most of the animation aesthetics like customizing the tooltip to include sub charts and the creation of the motion chart done during the in class exercise. The animation, interval and pages are easily incorporated in tableau. One of the hardest thing to create was to tweet the animation settings in R as there is a need to reference the ggiraph library
Customization Capabilities : While Tableau is easy to configure to the desired chart, certain customization abilities is only possible through R. For example, the use of functions to automatically adjust the ticks to detailed levels
“Smoothest” : In Tableau, running interactive charts seem to be smoother and much more responsive. This capability is fallen short in R. The charts seem to have a noticeable lag and producing highly customized charts have shown to slow down R studio a lot. This makes it had to test highly customized charts
Overall, Tableau provides as quick way to do up an already interactive chart. R, on the other hand, provides a lot of customization capabilities and flexibility. The two software have their own peaks at producing interactive charts, however, it would depend on the level of customization needed. Tableau is quick to market while R is highly flexible.
It seems like the adaptation with JavaScript has helped ggplot plots to be adapted to be interactive. It shows the flexibility of creating these charts using the different libraries; each of which the customization of the charts would be different. How we layer and build on to existing geom objects would dependent on the library you use.